How to use Amazon Comprehend operations using the AWS SDK for Python (Boto3)
We will demonstrate sample code to use the main functions and Topic Modeling that use Amazon Comprehend in the AWS SDK for Python (Boto3).
The main functions
- DetectDominantLanguage
- DetectEntities
- DetectKeyPhrases
- DetectSentiment
These functions also provide batch APIs that batch process up to 25 documents.
Functions used in Topic Modeling
- StartTopicsDetectionJob
- DescribeTopicsDetectionJob
- ListTopicsDetectionJobs
Sample Code
Let's start by looking at the main four functions.
import boto3 import json # Comprehend constant REGION = 'us-west-2' # Function for detecting the dominant language def detect_dominant_language(text): comprehend = boto3.client('comprehend', region_name=REGION) response = comprehend.detect_dominant_language(Text=text) return response # Function for detecting named entities def detect_entities(text, language_code): comprehend = boto3.client('comprehend', region_name=REGION) response = comprehend.detect_entities(Text=text, LanguageCode=language_code) return response # Function for detecting key phrases def detect_key_phraes(text, language_code): comprehend = boto3.client('comprehend', region_name=REGION) response = comprehend.detect_key_phrases(Text=text, LanguageCode=language_code) return response # Function for detecting sentiment def detect_sentiment(text, language_code): comprehend = boto3.client('comprehend', region_name=REGION) response = comprehend.detect_sentiment(Text=text, LanguageCode=language_code) return response def main(): # text text = "Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text." # language code language_code = 'en' # detecting the dominant language result = detect_dominant_language(text) print("Starting detecting the dominant language") print(json.dumps(result, sort_keys=True, indent=4)) print("End of detecting the dominant language\n") # detecting named entities result = detect_entities(text, language_code) print("Starting detecting named entities") print(json.dumps(result, sort_keys=True, indent=4)) print("End of detecting named entities\n") # detecting key phrases result = detect_key_phraes(text, language_code) print("Starting detecting key phrases") print(json.dumps(result, sort_keys=True, indent=4)) print("End of detecting key phrases\n") # detecting sentiment result = detect_sentiment(text, language_code) print("Starting detecting sentiment") print(json.dumps(result, sort_keys=True, indent=4)) print("End of detecting sentiment\n") if __name__ == '__main__': main()
DetectDominantLanguage operation.
It can detect the dominant language of the document. Amazon Comprehend can detect 101 different languages.
Execution result
{ "Languages": [ { "LanguageCode": "en", "Score": 0.9940536618232727 } ], "ResponseMetadata": { "HTTPHeaders": { "connection": "keep-alive", "content-length": "64", "content-type": "application/x-amz-json-1.1", "date": "Thu, 22 Mar 2018 04:15:20 GMT", "x-amzn-requestid": "a29fda00-2d87-11e8-ad56-************" }, "HTTPStatusCode": 200, "RequestId": "a29fda00-2d87-11e8-ad56-************", "RetryAttempts": 0 } }
DetectEntities operation.
It can detect the entities, such as persons or places, in the document.
Execution result
{ "Entities": [ { "BeginOffset": 0, "EndOffset": 6, "Score": 0.8670787215232849, "Text": "Amazon", "Type": "ORGANIZATION" }, { "BeginOffset": 7, "EndOffset": 17, "Score": 1.0, "Text": "Comprehend", "Type": "COMMERCIAL_ITEM" } ], "ResponseMetadata": { "HTTPHeaders": { "connection": "keep-alive", "content-length": "201", "content-type": "application/x-amz-json-1.1", "date": "Thu, 22 Mar 2018 04:15:20 GMT", "x-amzn-requestid": "a2b84450-2d87-11e8-b3f9-************" }, "HTTPStatusCode": 200, "RequestId": "a2b84450-2d87-11e8-b3f9-************", "RetryAttempts": 0 } }
DetectKeyPhrases operation.
It detects key phrases in the document contents.
Execution result
{ "KeyPhrases": [ { "BeginOffset": 0, "EndOffset": 17, "Score": 0.9958747029304504, "Text": "Amazon Comprehend" }, { "BeginOffset": 21, "EndOffset": 50, "Score": 0.9654422998428345, "Text": "a natural language processing" }, { "BeginOffset": 52, "EndOffset": 55, "Score": 0.941932201385498, "Text": "NLP" }, { "BeginOffset": 57, "EndOffset": 64, "Score": 0.9076098203659058, "Text": "service" }, { "BeginOffset": 75, "EndOffset": 91, "Score": 0.872683584690094, "Text": "machine learning" }, { "BeginOffset": 100, "EndOffset": 126, "Score": 0.9918361902236938, "Text": "insights and relationships" }, { "BeginOffset": 130, "EndOffset": 134, "Score": 0.998969554901123, "Text": "text" } ], "ResponseMetadata": { "HTTPHeaders": { "connection": "keep-alive", "content-length": "615", "content-type": "application/x-amz-json-1.1", "date": "Thu, 22 Mar 2018 04:15:21 GMT", "x-amzn-requestid": "a2d409a7-2d87-11e8-a9a6-************" }, "HTTPStatusCode": 200, "RequestId": "a2d409a7-2d87-11e8-a9a6-************", "RetryAttempts": 0 } }
DetectSentiment operation.
It detects the emotion (positive, negative, mixed, or neutral) in the document contents.
Execution result
{ "ResponseMetadata": { "HTTPHeaders": { "connection": "keep-alive", "content-length": "161", "content-type": "application/x-amz-json-1.1", "date": "Thu, 22 Mar 2018 04:15:21 GMT", "x-amzn-requestid": "a2ebb00b-2d87-11e8-9c58-************" }, "HTTPStatusCode": 200, "RequestId": "a2ebb00b-2d87-11e8-9c58-************", "RetryAttempts": 0 }, "Sentiment": "NEUTRAL", "SentimentScore": { "Mixed": 0.003294283989816904, "Negative": 0.01219215989112854, "Neutral": 0.7587229609489441, "Positive": 0.2257905900478363 } }
Topic Modeling
Let’s try executing the topic detection job.
Sample Code
import boto3 import json import time from bson import json_util # Comprehend constant REGION = 'us-west-2' # A low-level client representing Amazon Comprehend comprehend = boto3.client('comprehend', region_name=REGION) # Start topics detection job setting input_s3_url = "s3://your_input" input_doc_format = "ONE_DOC_PER_FILE or ONE_DOC_PER_FILE" output_s3_url = "s3://your_output" data_access_role_arn = "arn:aws:iam::aws_account_id:role/role_name" number_of_topics = 10 job_name = "Job_name" input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format} output_data_config = {"S3Uri": output_s3_url} # Starts an asynchronous topic detection job. response = comprehend.start_topics_detection_job(NumberOfTopics=number_of_topics, InputDataConfig=input_data_config, OutputDataConfig=output_data_config, DataAccessRoleArn=data_access_role_arn, JobName=job_name) # Gets job_id job_id = response["JobId"] print('job_id: ' + job_id) # It loops until JobStatus becomes 'COMPLETED' or 'FAILED'. while True: result = comprehend.describe_topics_detection_job(JobId=job_id) job_status = result["TopicsDetectionJobProperties"]["JobStatus"] if job_status in ['COMPLETED', 'FAILED']: print("job_status: " + job_status) break else: print("job_status: " + job_status) time.sleep(60) # You can get a list of the topic detection jobs that you have submitted. input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format} filter_job_name = {"JobName": job_name} topics_detection_job_list = comprehend.list_topics_detection_jobs(Filter=filter_job_name) print('topics_detection_job_list: ' + json.dumps(topics_detection_job_list, sort_keys=True, indent=4, default=json_util.default))
StartTopicsDetectionJob
Start topics detection job as an asynchronous operation.
You can confirm job status using DescribeTopicDetectionJob
after get JobId
.
There are two kinds of InputFormat. - ONE_DOC_PER_FILE - When one document is included in each file. - ONE_DOC_PER_LINE - When one file, each line of the file is considered a document.
DescribeTopicDetectionJob
Get job status for topic detection job. There are four statuses as follows:
- JobStatus
- SUBMITTED
- IN_PROGRESS
- COMPLETED
- FAILED
In this sample code, we escape the while
loop when JobStatus is COMPLETED
or FAILED
.
ListTopicsDetectionJobs
Get list of topic detection job.
Execution result
job_id: 2733262c2747153ab8cb0b01******** job_status: SUBMITTED job_status: IN_PROGRESS [...] job_status: COMPLETED topics_detection_job_list: { "ResponseMetadata": { "HTTPHeaders": { "connection": "keep-alive", "content-length": "415", "content-type": "application/x-amz-json-1.1", "date": "Thu, 22 Mar 2018 04:27:59 GMT", "x-amzn-requestid": "669ffb28-2d89-11e8-82a0-************" }, "HTTPStatusCode": 200, "RequestId": "669ffb28-2d89-11e8-82a0-************", "RetryAttempts": 0 }, "TopicsDetectionJobPropertiesList": [ { "EndTime": { "$date": 1521692818930 }, "InputDataConfig": { "InputFormat": "ONE_DOC_PER_FILE", "S3Uri": "s3://your_input" }, "JobId": "2733262c2747153ab8cb0b01********", "JobName": "Job4", "JobStatus": "COMPLETED", "NumberOfTopics": 10, "OutputDataConfig": { "S3Uri": "s3://your_output/**********-2733262c2747153ab8cb0b01********-1521692274392/output/output.tar.gz" }, "SubmitTime": { "$date": 1521692274392 } } ] }
Check the output file
Check that a file is created in the S3 bucket at the output destination.
You can confirm using ListTopicsDetectionJobs
.
- OutputDataConfig
"OutputDataConfig": { "S3Uri": "s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz" },
$ aws s3 cp s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz . $ tar -zxvf output.tar.gz x topic-terms.csv x doc-topics.csv
- topic-terms.csv
- List os topics in the collection. Each topic, by default, contains top terms according to topic weighting.
- doc-topics.csv
- Lists the documents associated with the topic and the percentage of documents related to that topic. ß Note: To get best results you need to use at least 1,000 documents with each topic modeling job.
Conclusion
Since I think the API to use Amazon Comprehend will be useful, I have introduced sample code of each function using the AWS SDK for Python (boto3).